Robust Cross-Lingual Genre Classification through Comparable Corpora
نویسندگان
چکیده
Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections of texts from the same set of genres but written in different languages – are exploited to train classification models on multi-lingual text collections. The resulting genre classifiers are shown to be robust and high-performing when compared to mono-lingual training sets. The work also shows that comparable corpora can be used to identify features that are indicative of genre in various languages. These features can be considered stable genre predictors across a set of languages. Our experiments show that selecting stable features yields significant accuracy gains over the full feature set, and that a small amount of features can suffice to reliably distinguish between different genres.
منابع مشابه
The 5th Workshop on Building and Using Comparable Corpora
Classification of texts by genre can benefit applications in Natural Language Processing and Information Retrieval. However, a mono-lingual approach requires large amounts of labeled texts in the target language. Work reported here shows that the benefits of genre classification can be extended to other languages through cross-lingual methods. Comparable corpora – here taken to be collections o...
متن کاملComparable English - Russian Book Review Corpora for Sentiment Analysis
This paper presents a newly designed comparable corpora of book reviews consisting of two parts: Russian and English representing two very different languages. The corpora are comparable in terms of domain, style and size. This set of corpora may be of use for cross-lingual experiments in document-level sentiment classification. We also present brief description of the languageand domain-specif...
متن کاملLabel Propagation for Fine-Grained Cross-Lingual Genre Classification
Cross-lingual methods can bring the benefits of genre classification to languages which lack genre-annotated training data. However, prior work in this field has been evaluated on coarse genres only. To predict fine-grained genres across languages, we propose a label propagation method, which combines separate sets of features. The results are promising, as the approach outperforms most baselin...
متن کاملAn Efficient Cross-lingual Model for Sentence Classification Using Convolutional Neural Network
In this paper, we propose a cross-lingual convolutional neural network (CNN) model that is based on word and phrase embeddings learned from unlabeled data in two languages and dependency grammar. Compared to traditional machine translation (MT) based methods for cross lingual sentence modeling, our model is much simpler and does not need parallel corpora or language specific features. We only u...
متن کاملCross-Lingual Genre Classification for Closely Related Languages
Resource-scarcity is a topic that is continually researched by the HLT community, especially for the SouthAfrican context. We explore the possibility of leveraging existing resources to help facilitate the development of new resources for under-resourced languages by using cross-lingual classification methods. We investigate the application of an Afrikaans genre classification system on Dutch t...
متن کامل